15 research outputs found
Wide・Deepモデルを用いた機械学習を高速化するためのアルゴリズム
京都大学新制・課程博士博士(情報学)甲第23310号情博第746号新制||情||127(附属図書館)京都大学大学院情報学研究科知能情報学専攻(主査)教授 鹿島 久嗣, 教授 田中 利幸, 教授 山下 信雄学位規則第4条第1項該当Doctor of InformaticsKyoto UniversityDFA
Fast Saturating Gate for Learning Long Time Scales with Recurrent Neural Networks
Gate functions in recurrent models, such as an LSTM and GRU, play a central
role in learning various time scales in modeling time series data by using a
bounded activation function. However, it is difficult to train gates to capture
extremely long time scales due to gradient vanishing of the bounded function
for large inputs, which is known as the saturation problem. We closely analyze
the relation between saturation of the gate function and efficiency of the
training. We prove that the gradient vanishing of the gate function can be
mitigated by accelerating the convergence of the saturating function, i.e.,
making the output of the function converge to 0 or 1 faster. Based on the
analysis results, we propose a gate function called fast gate that has a doubly
exponential convergence rate with respect to inputs by simple function
composition. We empirically show that our method outperforms previous methods
in accuracy and computational efficiency on benchmark tasks involving extremely
long time scales.Comment: 9 pages of main texts with 4 pages appendices, 12 figure
Smoothness Analysis of Adversarial Training
Deep neural networks are vulnerable to adversarial attacks. Recent studies
about adversarial robustness focus on the loss landscape in the parameter space
since it is related to optimization and generalization performance. These
studies conclude that the difficulty of adversarial training is caused by the
non-smoothness of the loss function: i.e., its gradient is not Lipschitz
continuous. However, this analysis ignores the dependence of adversarial
attacks on model parameters. Since adversarial attacks are optimized for
models, they should depend on the parameters. Considering this dependence, we
analyze the smoothness of the loss function of adversarial training using the
optimal attacks for the model parameter in more detail. We reveal that the
constraint of adversarial attacks is one cause of the non-smoothness and that
the smoothness depends on the types of the constraints. Specifically, the
constraint can cause non-smoothness more than the constraint.
Moreover, our analysis implies that if we flatten the loss function with
respect to input data, the Lipschitz constant of the gradient of adversarial
loss tends to increase. To address the non-smoothness, we show that EntropySGD
smoothens the non-smooth loss and improves the performance of adversarial
training.Comment: 22 pages, 7 figures. In V3, we add the results of EntropySGD for
adversarial trainin
Switching One-Versus-the-Rest Loss to Increase the Margin of Logits for Adversarial Robustness
Adversarial training is a promising method to improve the robustness against
adversarial attacks. To enhance its performance, recent methods impose high
weights on the cross-entropy loss for important data points near the decision
boundary. However, these importance-aware methods are vulnerable to
sophisticated attacks, e.g., Auto-Attack. In this paper, we experimentally
investigate the cause of their vulnerability via margins between logits for the
true label and the other labels because they should be large enough to prevent
the largest logit from being flipped by the attacks. Our experiments reveal
that the histogram of the logit margins of na\"ive adversarial training has two
peaks. Thus, the levels of difficulty in increasing logit margins are roughly
divided into two: difficult samples (small logit margins) and easy samples
(large logit margins). On the other hand, only one peak near zero appears in
the histogram of importance-aware methods, i.e., they reduce the logit margins
of easy samples. To increase logit margins of difficult samples without
reducing those of easy samples, we propose switching one-versus-the-rest loss
(SOVR), which switches from cross-entropy to one-versus-the-rest loss (OVR) for
difficult samples. We derive trajectories of logit margins for a simple problem
and prove that OVR increases logit margins two times larger than the weighted
cross-entropy loss. Thus, SOVR increases logit margins of difficult samples,
unlike existing methods. We experimentally show that SOVR achieves better
robustness against Auto-Attack than importance-aware methods.Comment: 25 pages, 18 figure
Absum: Simple Regularization Method for Reducing Structural Sensitivity of Convolutional Neural Networks
We propose Absum, which is a regularization method for improving adversarial
robustness of convolutional neural networks (CNNs). Although CNNs can
accurately recognize images, recent studies have shown that the convolution
operations in CNNs commonly have structural sensitivity to specific noise
composed of Fourier basis functions. By exploiting this sensitivity, they
proposed a simple black-box adversarial attack: Single Fourier attack. To
reduce structural sensitivity, we can use regularization of convolution filter
weights since the sensitivity of linear transform can be assessed by the norm
of the weights. However, standard regularization methods can prevent
minimization of the loss function because they impose a tight constraint for
obtaining high robustness. To solve this problem, Absum imposes a loose
constraint; it penalizes the absolute values of the summation of the parameters
in the convolution layers. Absum can improve robustness against single Fourier
attack while being as simple and efficient as standard regularization methods
(e.g., weight decay and L1 regularization). Our experiments demonstrate that
Absum improves robustness against single Fourier attack more than standard
regularization methods. Furthermore, we reveal that robust CNNs with Absum are
more robust against transferred attacks due to decreasing the common
sensitivity and against high-frequency noise than standard regularization
methods. We also reveal that Absum can improve robustness against
gradient-based attacks (projected gradient descent) when used with adversarial
training.Comment: 16 pages, 39 figure
Fast Lasso Algorithm via Selective Coordinate Descent
For the AI community, the lasso proposed by Tibshirani is an important regression approach in finding explanatory predictors in high dimensional data. The coordinate descent algorithm is a standard approach to solve the lasso which iteratively updates weights of predictors in a round-robin style until convergence. However, it has high computation cost. This paper proposes Sling, a fast approach to the lasso. It achieves high efficiency by skipping unnecessary updates for the predictors whose weight is zero in the iterations. Sling can obtain high prediction accuracy with fewer predictors than the standard approach. Experiments show that Sling can enhance the efficiency and the effectiveness of the lasso
Meta-ticket: Finding optimal subnetworks for few-shot learning within randomly initialized neural networks
Few-shot learning for neural networks (NNs) is an important problem that aims
to train NNs with a few data. The main challenge is how to avoid overfitting
since over-parameterized NNs can easily overfit to such small dataset. Previous
work (e.g. MAML by Finn et al. 2017) tackles this challenge by meta-learning,
which learns how to learn from a few data by using various tasks. On the other
hand, one conventional approach to avoid overfitting is restricting hypothesis
spaces by endowing sparse NN structures like convolution layers in computer
vision. However, although such manually-designed sparse structures are
sample-efficient for sufficiently large datasets, they are still insufficient
for few-shot learning. Then the following questions naturally arise: (1) Can we
find sparse structures effective for few-shot learning by meta-learning? (2)
What benefits will it bring in terms of meta-generalization? In this work, we
propose a novel meta-learning approach, called Meta-ticket, to find optimal
sparse subnetworks for few-shot learning within randomly initialized NNs. We
empirically validated that Meta-ticket successfully discover sparse subnetworks
that can learn specialized features for each given task. Due to this task-wise
adaptation ability, Meta-ticket achieves superior meta-generalization compared
to MAML-based methods especially with large NNs.Comment: Code will be available at https://github.com/dchiji-ntt/meta-ticke